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Abstract. Hospital profiling involves a comparison of a health care 
provider's structure, processes of care, or outcomes to a standard, of- 
ten in the form of a report card. Given the ubiquity of report cards 
and similar consumer ratings in contemporary American culture, it is 
notable that these are a relatively recent phenomenon in health care. 
Prior to the 1986 release of Medicare hospital outcome data, little such 
information was publicly available. We review the historical evolution 
of hospital profiling with special emphasis on outcomes; present a de- 
tailed history of cardiac surgery report cards, the paradigm for mod- 
ern provider profiling; discuss the potential unintended negative conse- 
quences of public report cards; and describe various statistical method- 
ologies for quantifying the relative performance of cardiac surgery pro- 
grams. Outstanding statistical issues are also described. 
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1. INTRODUCTION 

Profiling involves a comparison of a health care 
provider's structure, processes of care, or outcomes 
to a normative or community standard (Gatsonis, 
1998). The results of such profiling are typically 
presented in the form of a report card, whose pur- 
pose is to quantify quality of care (Normand, 2005). 
This quality triad was first conceptualized by Avedis 
Donabedian (Donabedian, 1980), a Distinguished 
University Professor of Public Health at the Uni- 
versity of Michigan whose work was devoted to the 
study of health care quality. Structural measures in- 
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elude, for example, nursing ratios, the presence of 
residency programs, availability of advanced tech- 
nology and procedural volume. Such measures are 
straightforward and relatively easy to define, but 
their precise relationship to actual health outcomes 
is often difficult to quantify. For example, volume 
has been shown to be a good quality surrogate for 
many surgical procedures. However, for coronary 
artery bypass grafting (CABG) surgery, an opera- 
tion that creates a new route around a blocked por- 
tion of an artery in order to increase blood supply to 
the heart muscle, this relationship is relatively weak 
(Peterson et al., 2004; Shroyer et al., 1996; Shahian 
and Normand, 2003). 

Process measures refer to what providers do to 
and for patients. These include documented adher- 
ence to established best practices, such as the use 
of peri-operative beta blockade to reduce myocar- 
dial ischemia, or time from hospital arrival to an- 
gioplasty for acute myocardial infarction in order to 
limit irreversible heart muscle injury. Process mea- 
sures may be available for many conditions where 
outcome measures do not exist or have limited ap- 
plicability due to sample size issues or infrequent 
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endpoints. Another attractive feature is that they 
are transparent and actionable, giving providers im- 
mediate guidance as to where to focus improvement 
efforts. Process measures also have the advantage 
of not requiring direct risk adjustment, but legiti- 
mate exclusions from the denominator, such as con- 
traindications to the recommended practice, must 
be recognized and documented. With the increasing 
availability of evidence-based practice guidelines, 
there has been a corresponding increase in the num- 
ber of candidate measures for process-based quality 
improvement. However, there is some concern that 
excessive emphasis on achieving compliance with pro- 
cess measures stimulated by public reporting or pay- 
for-performance initiatives might lead to unneces- 
sary screening procedures or treatments, or that they 
might conflict with a physician's best judgment or 
patient preference (Werner and Asch, 2005). Fur- 
thermore, process measures focus on only a limited 
segment of the overall care provided for a particular 
medical condition; they may be applicable to only a 
small percentage of patients with a given condition; 
and they explain a relatively small percentage of the 
variability in outcomes, which is the real interest of 
patients (Krumholz et al., 2007; Werner and Brad- 
low, 2006; Bradley et al., 2006; Rathore et al., 2003). 

Since the publication of several Institute of Medi- 
cine reports on quality, transparency and patient 
safety (Institute of Medicine, 2000, 2001), attention 
has increasingly been focused on the objective mea- 
surement of health care outcomes. Outcomes mea- 
sures refer to responses that characterize the pa- 
tient's health status and include, for example, peri- 
operative mortality, morbidity and functional sta- 
tus. Coronary artery bypass grafting surgery is one 
of the few procedures performed with sufficient fre- 
quency to justify statistical assessment of outcomes 
to quantify provider performance (Dimick, Welch 
and Birkmeyer, 2004). It is the outcomes component 
of Donabedian's triad that requires the most sophis- 
ticated statistical approach and this is the primary 
motivation for this article. 

As described by Birkmeyer, Dimick and Birkmeyer 
(2004), the choice of structural, process or outcome 
measures to assess quality is dependent on both the 
overall procedural frequency and the potential for 
serious adverse consequences. The focus of this ar- 
ticle on public profiling of hospital outcomes is not 
intended to diminish the importance of either con- 
fidential continuous quality improvement (CQI) ac- 
tivities or profiling based on process or structural 



measures. For example, in cardiac surgery, substan- 
tial improvements in overall regional CABG surgery 
outcomes and a marked reduction in interprovider 
variability have been achieved through a confiden- 
tial CQI approach, which is a completely different 
paradigm than public report cards (O'Connor et al., 
1996). Furthermore, the Society of Thoracic Sur- 
geons Quality Measurement Taskforce has developed 
a multidimensional composite measure of cardiac 
surgery quality that encompasses each of the compo- 
nents of Donabedian's triad. Notwithstanding these 
caveats, in the current era of health care perfor- 
mance transparency, public accountability for out- 
comes is unquestionably of major importance to pay- 
ors, regulators and consumers. 

2. CLINICAL CONSIDERATIONS 

2.1 Background 

From a historical standpoint, measurement of 
health care outcomes has a long but rather sparse 
history. The basic concept of comparative profiling 
has been understood for over 150 years. Florence 
Nightingale, born in Italy in 1820 of English parents, 
obtained an early education that was notable for its 
unusually strong emphasis in mathematics. She felt 
called to the profession of nursing and ultimately 
led a group of nurses to Scutari during the Crimean 
War (Cohen, 1984; lezzoni, 1997a, 2003; Stinnett et 
al., 1990; Spiegelhalter, 1999). Conditions in British 
hospitals were deplorable, and Nightingale was able 
to accumulate extensive data on the causes of death 
among British soldiers, subsequently displaying these 
outcome data in "coxcomb" or polar area charts. 
These demonstrated that diseases related to poor 
sanitary conditions in such hospitals killed many 
times more soldiers than war wounds. Upon her re- 
turn to England, she continued her analytical stud- 
ies by comparing mortality statistics among London 
hospitals (lezzoni, 1997a, 2003). Significant dispar- 
ities in outcomes were noted, many of which were 
felt to be due to overcrowding and generally unsan- 
itary conditions. In various articles, she also noted 
the importance of accounting for patient status on 
admission, presaging current efforts to account for 
case mix through risk adjustment. Finally, she also 
noted that some hospitals intentionally discharged 
terminal patients, only to have their death occur in 
another institution. This practice unfairly biased in- 
stitutional performance comparisons, and it is one 
form of outcomes gaming (Shahian et al., 2001). For 
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these and numerous other reasons, contemporary re- 
searchers have proposed that in-hospital mortahty 
is an inadequate measure of performance and that 
30-day follow-up or even longer should be consid- 
ered the norm (Osswald et al., 1999; Shahian et al., 
2001). Because of her contributions to applied statis- 
tics and epidemiology, with particular emphasis on 
health care improvement, Nightingale was made a 
Fellow of the Royal Statistical Society and an Hon- 
orary Member of the American Statistical Associa- 
tion, both rare honors for a woman of her era (Co- 
hen, 1984; lezzoni, 1997a, 2003; Spiegelhalter, 1999; 
Stinnett et al., 1990). 

Unquestionably, the most significant early pioneer 
in the area of outcomes measurement, analysis and 
public report cards was Dr. Ernest Amory Codman, 
a turn-of-the-century Boston Brahmin who gradu- 
ated from Harvard College and Harvard Medical 
School (lezzoni, 1997a, 2003; Passaro and Organ, 
1999; Spiegelhalter, 1999; Mallon, 2000). Like Flo- 
rence Nightingale, he had always displayed a passion 
for objective quantitative data, and he maintained 
a log of shells expended to birds shot when hunt- 
ing (Passaro and Organ, 1999). While Nightingale's 
approach was more epidemiologic, Codman focused 
more on surgical audit of individual cases, classifica- 
tion of surgical errors and comparison of outcomes 
(Spiegelhalter, 1999). He developed the first anes- 
thesia chart as well as a unique system of classifying 
surgical errors. Although on the surgical staff at the 
Massachusetts General Hospital, he started his own 
hospital on Pinckney Street in Boston in 1911 and 
maintained meticulous records of both short- and 
long-term outcomes. His incessant plea for hospitals 
to maintain accurate records of patient outcomes, 
to accept responsibility for these outcomes and to 
publicize them were not well received. Together with 
his suggestion that the seniority system for selecting 
surgical leadership should be eliminated, this ulti- 
mately led to his resignation from the staff at the 
Massachusetts General Hospital and his alienation 
from much of the Boston medical community. He 
nonetheless persevered and was instrumental in the 
development of the American College of Surgeons. 
He served as Chair of its Committee on Hospital 
Standardization, a forerunner of the Joint Commis- 
sion on Accreditation of Healthcare Organizations 
(Passaro and Organ, 1999; Mallon, 2000). 

It is to giants such as Florence Nightingale and 
Ernest Amory Codman, considered by many to be 
iconoclasts in their own day, that our modern ap- 
proach to outcomes analysis owes incalculable debt. 



2.2 Cardiac Surgery Profiling 

The modern era of publicly profiling institutional 
performance began not in health care but in edu- 
cation, most notably in Great Britain (Aitkin and 
Longford, 1986; Goldstein and Spiegelhalter, 1996; 
Jaeger, 1989). Studies of school effectiveness antic- 
ipated many of the statistical issues and controver- 
sies currently being debated in health care. Profiling 
initiatives in the latter area arguably began in 1986 
with the first annual report of hospital-level data 
for 17 broad diagnostic and procedure groups, re- 
leased by the Health Care Financing Administration 
(HCFA), now known as CMS, the Center for Medi- 
care and Medicaid Services. CMS is the U.S. federal 
agency that administers the Medicare program and 
works with states to administer Medicaid and the 
State Children's Health Insurance Program. Hospi- 
tals in the 1986 report with higher than expected 
mortality rates were classified as having potential 
quality problems. These reports were widely criti- 
cized for their failure to adequately account for pa- 
tient severity, not to mention numerous other anoma- 
lies such as high mortality rates attributed to a hos- 
pice (Berwick and Wald, 1990; lezzoni, 1997a; Kas- 
sirer, 1994). The public release of hospital report 
cards was suspended in 1994 (Kassirer, 1994). None- 
theless, certain specialties recognized that, however 
fiawed, this report signaled the beginning of a new 
era of increased accountability in health care that 
could not be ignored. Cardiac surgery was the first 
and most prominent of these specialties, and it has 
unquestionably become the paradigm for modern 
health care outcomes measurement. Coronary artery 
bypass grafting (CABG) surgery is the focus of many 
profiling efforts. It is the most commonly performed 
major complex surgical procedure, it is costly, and 
it has well-defined endpoints including serious com- 
plications and death (Shahian et al., 2001). 

Soon after the publication of the HCFA report 
card, the Society of Thoracic Surgeons established 
an Ad Hoc Committee on Risk Factors for CABG, 
and the Society also began work on development 
of what ultimately became the Society of Thoracic 
Surgeons National Cardiac Database (STS NCD), 
which was released to its membership in 1990. Dur- 
ing the same time period, other seminal studies of 
risk-adjusted cardiac surgery outcomes demonstrated 
unexpected and significant variability in outcomes 
not accounted for by case mix. In an analysis of 7596 
New York State patients who had undergone open 
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heart surgery during the first six months of 1989, 
Hannan at al. (1990) noted that unadjusted insti- 
tutional mortahty rates varied from 2.2% to 14.3%, 
which was a much greater range than expected. The 
Northern New England Cardiovascular Study Group 
found that among five centers and 18 cardiothoracic 
surgeons who performed CABG surgery in Maine, 
New Hampshire and Vermont between 1987 and 
1989, the unadjusted in-hospital mortality varied 
from 3.1% to 6.3% among hospitals, and a signif- 
icant difference was noted even after risk adjust- 
ment (O'Connor et al., 1991). Williams, Nash and 
Goldfarb (1991) studied CABG results from 1985 
to 1987 at five Philadelphia teaching hospitals and 
found a twofold variation in mortality for DRG 106 
(coronary artery bypass surgery without coronary 
catheterization) that was not accounted for by the 
limited risk adjustment that was available. 

These early findings stimulated the further ap- 
plication of statistical risk models to assist in (a) 
the identification of pre-operative factors affecting 
CABG surgery outcomes; (b) patient counseling; (c) 
procedure selection; (d) outcomes assessment and 
profiling; and (e) continuous quality improvement 
activities (Shahian et al., 2001, 2004). Subsequently, 
similar models have been developed for post-cardiac 
surgery complication rates and also for valvular heart 
surgery, congenital heart surgery and general tho- 
racic surgery (Shahian et al., 2004). In addition to 
the STS NCD, other excellent risk models continue 
to be employed by the New York Cardiac Surgery 
Reporting System (GSRS), the Northern New Eng- 
land Cardiovascular Disease Study Group (NNE), 
the Veterans Affairs Administration and a European 
consortium (Shahian et al., 2001, 2004). 

By far the most controversial use of risk models 
has been for the determination and comparison of 
risk-adjusted mortality rates for hospitals and indi- 
vidual surgeons. Typically, the probabilities of death 
for all of a provider's patients are estimated from 
logistic regression, aggregated, and compared with 
the observed number of deaths, usually by means 
of a ratio of the observed number to the expected 
number. This may then be multiplied by the overall 
unadjusted mortality for a state or region, yielding 
a so-called risk-standardized mortality rate. Numer- 
ous statistical concerns have been expressed regard- 
ing this approach, including the inaccuracy of esti- 
mates from low-volume providers with small sam- 
ple sizes, clustering (nonindependence) of patients 
among providers, and multiple comparisons (Thomas, 



Longford and Rolph, 1994; Goldstein and Spiegel- 
halter, 1996; Christiansen and Morris, 1997; Nor- 
mand, Glickman and Gastonis, 1997; Shahian et al., 
2001, 2004). 

Few published studies have correlated risk-stan- 
dardized mortality rates with objective or subjective 
expert assessment of quality. In hospital site visits, 
Daley et al. (1997) found differences in processes and 
delivery of surgical care in Veterans Affairs Medical 
Centers that correlated with and corroborated sta- 
tistical measures of high hospital risk-adjusted mor- 
tality and morbidity. In a recent study of hospital 
process measures and in-hospital mortality among 
patients with acute coronary syndromes, Peterson 
et al. (2006) found significant associations between 
use of needed therapies and mortality. 

2.2.1 Data quality. In any profiling initiative, the 
quality of the data is more important than choice of 
statistical models. Clear and concise definitions for 
data elements are exceedingly important, especially 
for those variables that are most highly predictive 
of mortality. Coding accuracy may significantly af- 
fect risk-adjusted outcome results. A prospectively 
maintained clinical database containing core clini- 
cal variables is the best data source for profiling 
(Shahian et al., 2001, 2004; Krumholz et al., 2006a). 

Administrative claims data, consisting of demo- 
graphic, diagnosis and procedural codes are derived 
primarily from insurance claims. These data are read- 
ily available for millions of patients, but they are not 
collected with the primary goal of assessing risk- 
adjusted patient outcomes. Cases may be missed 
or misclassified, and important but nonreimbursable 
diagnoses may be excluded (Krumholz et al., 2006a; 
lezzoni, 1997a,b, 2003; Shahian et al., 2001, 2004). 
In a recent study by Mack et al. (2005), CABG cases 
at one Texas hospital were analyzed to determine 
whether there was agreement between the results 
from an audited clinical database (STS NCD) and 
federal and state administrative databases. There 
were significant disparities in both the volumes of 
cases and the unadjusted mortality rates with ad- 
ministrative data significantly overstating the latter. 
Similar case misclassification errors were observed in 
Massachusetts when the results from a carefully au- 
dited registry using clinical data (STS NCD) were 
compared with those from a state administrative 
database (Shahian et al., 2007). 

Similarly, separation of pre-operative co-morbidities 
(case-mix adjustors) from post-operative complica- 
tions is problematic when using administrative data 
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sources (lezzoni, 1997a, b, 2003). Such misclassifi- 
cation may lead to a provider in essence receiving 
credit for operating upon a patient with a serious 
pre-existing condition, when in fact that condition 
was a major comphcation of care. For example, poor 
surgical care may lead to pulmonary complications 
such as pneumonia; inclusion of pneumonia in the 
statistical model would actually "adjust away" the 
outcome. Conversely, misclassifying a true pre-oper- 
ative risk factor as a complication fails to account 
for case-mix acuity, thus disadvantaging a provider's 
risk-adjusted outcomes. This critical deficiency of 
administrative databases has led to the development 
of numerous computerized algorithms for correctly 
classifying secondary diagnoses, none of which is 
foolproof and all of which are inferior to the defini- 
tive classification available in a clinical database. 
When only administrative data are available, date 
stamping or "present at admission" indicators are 
the best compromise solution for correctly classi- 
fying secondary diagnoses. The Office of Statewide 
Health Planning and Development in California, for 
example, creates inpatient hospital discharge data 
files that separate conditions present on admission 
from those present on discharge. 

A comprehensive analysis of the impact of co- 
morbidity undercoding has recently been conducted 
by Austin et al. (2005) using Monte Carlo simula- 
tion techniques. Although the assignment of outlier 
status was relatively robust to undercoding of sever- 
ity and co-morbidities, miscoding of very influential 
predictors, such as shock or renal failure, could lead 
to hospital misclassification. 

To assess the practical impact of using admin- 
istrative data for profiling, Hannan and associates 
compared New York CABG results determined from 
their dedicated clinical database (CSRS) with those 
derived from the New York administrative database 
(Hannan et al., 1992) and the federal Medicare ad- 
ministrative database (Hannan et al., 1997a). Not 
surprisingly, models based upon clinical data pro- 
vided superior discrimination and accuracy in ex- 
plaining variations in patient mortality. Models de- 
rived from administrative data had significantly im- 
proved performance when a few critical clinical vari- 
ables were added, such as ejection fraction, 
re-operation or left main coronary artery disease. 
Studies from the Cardiac Care Network of Ontario 
(Tu, Sykora and Naylor, 1997), the Cooperative 
CABG Database Project (Jones et al., 1996) and the 
STS NCD (Shahian et al., 2004) suggest that a few 



critical core variables provide much of the impor- 
tant predictive information in any cardiac surgery 
database. 

Notwithstanding these legitimate concerns, for cer- 
tain conditions administrative data may be suffi- 
cient to report on some outcomes. For example, in 
studies of acute MI and heart failure, Krumholz 
et al. (2006b, c) found that while models based on 
medical record data had better discrimination be- 
tween survival and mortality for individual patients 
than administrative data, there were not many dif- 
ferences between the hospital-specific standardized 
risk- adjusted rates using these two data sources. 

Inclusion of demographic and socio-economic sta- 
tus (SES) variables as adjustors in the model raises 
concerns similar to those related to inclusion of com- 
plications. Disparities in outcomes among racial/ 
ethnic groups may be due to system-level factors, 
such as financing, structure of care and cultural- 
linguistic barriers; patient-level factors, such as pref- 
erences and biological differences; or physician or 
provider factors, such as bias (Institute of Medicine, 
2003). Inclusion of race/ethnicity would only make 
sense in a profiling context if there were biological 
differences in survival or patient preferences that im- 
pacted survival among different racial/ethnic groups. 
In CABG surgery, for example, because women are 
more likely to have smaller vessels that are techni- 
cally more difficult to bypass, sex is included in the 
mortality model. However, adjusting for race/ethnic- 
ity may in fact unfairly mask those institutions that 
have poor systems of care such as hospitals that lack 
translators or hospitals that provide suboptimal care 
to patients with fewer financial resources. SES vari- 
ables may be used to help understand differences in 
quality, such as access barriers, but they should not 
be used to quantify deficiencies in hospital perfor- 
mance. 

2.2.2 Audit and validation. Comprehensive data 
audit and validation are critical to any profiling ef- 
fort, and their absence is a significant theoretical 
deficiency of many voluntary initiatives and of reg- 
istries based on administrative data. We illustrate 
the importance of these measures with a descrip- 
tion of the processes employed in the implementa- 
tion of the first Massachusetts public report card 
for cardiac surgery (Shahian, Torchiana and Nor- 
mand, 2005). Like New York, New Jersey and Penn- 
sylvania (Shahian et al., 2001), Massachusetts has 
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a mandated surveillance program for invasive car- 
diac services that includes public reporting of hos- 
pital mortality (Shahian, Torchiana and Normand, 
2005). The first cardiac surgery report card in Mas- 
sachusetts was based upon data from 2002, during 
which time frame there were 4603 isolated CABG 
patients available for analysis (Table 1), distributed 
among 11 established cardiac surgery programs and 
two new programs (Hospitals 5 and 10). The first 
cardiac surgery was performed at Hospital 10 dur- 
ing April 2002 and at Hospital 5 that August. These 
outcome data were collected, cleaned, audited, val- 
idated and analyzed at a central data coordinating 
center (www.niassdac.org). Data submissions were 
validated using both state administrative and vital 
statistics databases. One hundred and fourteen data 
quality reports were given to hospitals with the ex- 
pectation that problematic data would be corrected 
or appropriate documentation provided. An audit 
of 500 cases including all deaths was performed by 
the local Quality Improvement Organization. A sec- 
ond audit of 724 charts was performed by an expert 
Adjudication Committee consisting of three senior 
Massachusetts cardiac surgeons. This focused pri- 
marily on data elements that were particularly im- 
portant in the risk model (e.g., urgent and emergent 
status, cardiogenic shock, etc.) as well as cases coded 
as CABG plus other, a category potentially used 
to hide mortalities. Additional documentation that 
was typically requested by the Adjudication Com- 
mittee included history and physical examinations, 
progress notes, operative notes, ICU flow charts and 
discharge notes. Eight hundred and thirty-five changes 
were made by the committee, each of which required 
unanimous agreement. 

2.3 Report Card Controversies 

Aside from issues relating to implementation (type 
of database, inclusion of critical core variables, audit 
and validation, and the selection and development 
of statistical models) , there are numerous philosoph- 
ical and practical concerns regarding both the effi- 
cacy and potential unintended negative consequences 
of report cards (Shahian et al., 2001, 2004; Werner 
and Asch, 2005). Report cards provide transparency 
and public accountability, which are perhaps suffi- 
cient justification for their existence (Shahian et al., 
2001). However, the market-based assumption that 
consumers will seek to choose the best providers and 
that providers will respond by improving their qual- 
ity is as yet unproven (Shahian et al., 2001; Werner 



Table 1 

30-day mortality in 13 nongovernmental hospitals following 
isolated CABG surgery, Massachusetts, USA 



Cardiac surgery 


Number 


Number (%) 


Expected 


program 


of cases 


of deaths 


mortality % 


(1) 


(2) 


(3) 


(4) 


1 


508 


11 (2.17) 


2.01 


2 


454 


11 (2.42) 


2.58 


3 


381 


15 (3.94) 


2.94 


4 


623 


11 (1.77) 


2.30 


5 


26 





1.10 


6 


393 


7 (1.78) 


2.15 


7 


718 


18 (2.51) 


2.20 


8 


149 


1 (0.67) 


1.45 


9 


80 





0.87 


10 


296 


5 (1.69) 


1.99 


11 


191 


3 (1.57) 


1.71 


12 


365 


4 (1.10) 


1.87 


13 


419 


15 (3.58) 


1.91 


All 


4603 


101 (2.19) 





Data correspond to surgeries performed in adults during the 
period January 1, 2002 through December 31, 2002. 

(2) : Number of admissions in which the first surgery was an 
isolated CABG surgery. 

(3) : Number (percent) of observed 30-day mortalities. 

(4) : Rates expected using estimates of the association between 
risk factors and mortality, ignoring hospital effects [Section 
3.3, (6)]. 

and Asch, 2005). Furthermore, public report cards 
may be an incentive for certain behaviors that ac- 
tually decrease overall health care quality (Dranove 
et al., 2003). 

2.3.1 Impact on mortality. Although there was a 
substantial decline in New York State cardiac surgery 
mortality that coincided with the introduction of 
public report cards, it is unclear whether publication 
of results was the primary mechanism (Shahian et 
al., 2001). Collecting and analyzing their own data 
forces hospitals to confront inferior results, and to 
institute changes to procedures and staff long be- 
fore such results could ever be disseminated to the 
public. Furthermore, the decline in New York CABG 
mortality occurred during a period when these same 
rates were falling nationally. Ghali et al. (1997) com- 
pared the mortality decline in New York with the re- 
sults from a comparable time frame in northern New 
England and Massachusetts and found little differ- 
ence in the magnitude of change. Northern New 
England had a voluntary, confidential approach to 
continuous quality improvement (CQI) with no pub- 
lic reporting, and Massachusetts had strong aca- 
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demic and clinical centers but no public report cards. 
In a study of Medicare CABG patients, Peterson et 
al. (1998) reviewed results for all states and regions 
from 1987 to 1992. The only geographic region with 
a level of outcomes improvement and low absolute 
CABG mortality level comparable to New York's 
was northern New England, which as noted has a 
totally confidential CQI approach. When the first 
official Massachusetts report card was published in 
2004, employing highly audited clinical data and 
a sophisticated statistical model, it was observed 
that the 2002 unadjusted mortality rate for CABG 
was 2.19%, arguably one of the lowest overall state 
CABG mortality rates ever reported. This was in 
a state which had never before had a public report 
card. Such observations suggest that report cards, 
although useful for public accountability, are only 
one of many motivating factors for high quality, and 
perhaps not essential. 

2.3.2 High-risk case avoidance. Analysis of New 
York chnical CABG data by Hannan et al. (1997b) 
suggests that modern risk-adjustment algorithms do 
in fact adequately protect providers who undertake 
the care of high-risk patients. However, despite the 
availability of such sophisticated CABG risk mod- 
els, the threat of public disclosure of results has in- 
evitably resulted in more selective acceptance of pa- 
tients by cardiac surgeons. Omoigui et al. (1996) 
noted that after the introduction of report cards in 
New York State, more New York patients with high- 
risk characteristics were sent to the Cleveland Clinic 
for cardiac surgery than had been referred during 
the pre-report period. Furthermore, these New York 
State patients at the Cleveland Clinic had the high- 
est expected mortality of any referral group there, 
and referrals from New York to Cleveland increased 
during the post-report card period in contrast to all 
other states. Subsequent studies by Chassin, Han- 
nan and DeBuono (1996) and Peterson et al. (1998) 
have challenged this out-migration hypothesis. How- 
ever, it is undeniable that many surgeons perceive 
that accepting such high-risk patients may jeopar- 
dize their reputations and referrals. In a study of 
Pennsylvania cardiac surgeons by Schneider and Ep- 
stein (1996), 63% reported that they were less will- 
ing to operate on severely ill patients subsequent to 
the introduction of report cards. Furthermore, 59% 
of cardiologists reported increased difficulty finding 
surgeons willing to accept such high-risk patients. 



Burack et al. (1999) found that high-risk CABG pa- 
tients in New York, whose results would be pub- 
licly reported, were more likely to be refused surgery 
than were similar high-risk patients with aortic dis- 
section, another type of cardiac surgery for which 
results are not reported. In that study, 62% of car- 
diac surgeons reported that they had refused to op- 
erate on at least one high-risk patient during the 
preceding year because of the fear of public report- 
ing. Numerous solutions to this problem have been 
recommended, such as the exclusion of high-risk pa- 
tients from reporting, compiling data on cardiac pa- 
tients from the time of initial referral in order to 
track inappropriate denials of surgical care, and the 
collection of other quality indicators in addition to 
mortality (e.g., morbidity, quality of life and func- 
tional improvement) (Shahian et al., 2001). There 
is no question that focusing on public reporting of 
mortality will result in some biasing of patient selec- 
tion and may deny surgical intervention to the very 
group of high-risk patients who might benefit most 
(Jones, 1989). Furthermore, patients denied appro- 
priate CABG may be subjected to less effective and 
cumulatively more costly therapies, leading to both 
higher societal costs and overall population mortal- 
ity rates (Jones, 1989; Dranove et al., 2003). 

2.3.3 Gaming. Diagnostic Related Groups (DRG)- 
based reimbursement strategies led to the develop- 
ment of "DRG creep," in which institutional cod- 
ing practices changed in order to maximize hospi- 
tal reimbursement. Similarly, when faced with the 
prospect of public outcomes reports that may im- 
pact licensure, referrals and pay-for-performance re- 
imbursement, surgeons and institutions may attempt 
to "game" the outcome reporting system. For ex- 
ample, by inappropriately coding pre-operative co- 
morbidities, especially those like emergency status 
or cardiogenic shock that are highly predictive of 
operative mortality, a provider's expected mortality 
is increased, and their 0/E ratio and risk-adjusted 
mortality rate decrease (Greene and Wintfeld, 1995; 
Parsonnet, 1995; Shahian et al., 2001). Careful audit 
is essential to detecting and discouraging such prac- 
tices, particularly when the frequency of co-morbid- 
ities for a particular institution is out of the usual 
range. Change of operative class is another form of 
gaming. If only isolated CABG procedures are pub- 
licly reported, a surgeon who anticipates a bad pa- 
tient outcome may add a relatively trivial additional 
component to the operation, such as closure of a 
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patent foramen ovale. This would remove the pro- 
cedure from the isolated CABG category which is 
publicly reported, and shift it into the CABG plus 
other category which is unreported. Careful audit 
of all CABG plus other cases is essential to detect- 
ing such practices, and it is also critical to define 
prospectively what types of cases truly justify an 
other designation. For example, in Massachusetts it 
was decided that CABG plus closure of a patent 
foramen ovale would be coded as an isolated CABG. 
It is a relatively trivial procedure that does not re- 
ally change the expected mortality, but it could the- 
oretically be applied to a significant percentage of 
the CABG population. 

For purposes of outcome reporting, it is also im- 
portant to follow patients for at least 30 days fol- 
lowing major surgery, if not longer (Osswald et al., 
1999). With modern technology, many patients who 
are very seriously ill after CABG may be kept alive 
for weeks or months, only to ultimately succumb to 
what are clearly complications of the operation. Fi- 
nally, in order to discourage pre-terminal transfer 
of patients to other facilities in order to hide their 
anticipated deaths, report cards should include all 
patients who die within 30 days, regardless of cause 
or venue. 

2.3.4 Consumer choice. Interestingly, despite the 
public accountability afforded by public report cards, 
there has been little objective evidence that this 
valuable information has redirected patients from 
high-mortality to low-mortality institutions. This 
was apparent in an early study by Vladeck and as- 
sociates (Vladeck et al., 1988) following release of 
HCFA Medicare mortality data. The authors con- 
cluded that long-standing referral preferences, tra- 
dition, convenience and personal recommendation 
were more important than objective information. 
Similarly, Blendon et al. (1998) found that the rec- 
ommendation of family and friends trumped objec- 
tive data in choosing health care providers. Fin- 
layson et al. (1999) determined that many patients 
preferred local care over demonstrably better care at 
regional referral centers, a strong geographic prefer- 
ence also shown by Shahian et al. (2000) for CABG 
surgery in Massachusetts. In Cleveland, an effort 
funded by local businesses to monitor and report 
quality of care did not demonstrate any measurable 
effect on consumer choice (Burton, 1999). Schneider 
and Epstein (1996, 1998) studied the responses of 
both cardiac surgery patients and cardiologists fol- 
lowing release of the Pennsylvania cardiac surgery 



report card. They found that few patients were aware 
of the report card or knew their surgeon's rating 
prior to surgery. Few regarded it as important in 
their choice of a provider and few cardiologists felt 
that it had significant impact on their referral rec- 
ommendations. Hannan et al. (1997c) found that 
only 38% of New York cardiologists thought that 
report cards had substantially impacted their re- 
ferral patterns, despite regarding these report cards 
as readable and reasonably accurate. Furthermore, 
there was no shift in the market share of percentage 
of New York CABG patients who had surgery at 
high-mortality versus low-mortality hospitals after 
the introduction of report cards (Chassin, Hannan 
and DeBuono, 1996; Jha and Epstein, 2006). 

It might be expected that payers, having easier 
access to outcome data, would be better able to di- 
rect patients to high-quality providers. However, in 
separate studies by Shahian and associates in Mas- 
sachusetts (2000) and Erickson and colleagues in 
New York (2000), using completely different data- 
bases and statistical methodologies, both groups 
found ironically that managed care patients were 
less likely to have surgery at lower mortality hos- 
pitals. In general, promoting consumerism in health 
care has not been successful thus far. Consumer- 
driven health plans, which involve insurance arrange- 
ments that give employees greater choice among ben- 
efits and providers, but also expose them to greater 
financial risk, are the latest idea in health insurance. 
Early results indicate that beneficiaries in consumer- 
driven health plans have lower satisfaction, higher 
out-of-pocket costs and more missed health care than 
consumers in more comprehensive health insurance 
(Fronstin and Collins, 2005). 

3. STATISTICAL ISSUES AND METHODS 

3.1 Historical Approaches 

Early attempts at measuring quality of care were 
based on tests of excess variation. These were soon 
replaced by methods that estimated quality mea- 
sures and then attempted to identify poorly per- 
forming providers through tail probabilities. The first 
widely disseminated summary of variability in the 
quality of health care appeared in a 1973 Science ar- 
ticle that examined medical and surgical rates across 
193 hospital areas in New England (Wennberg and 
Gittelsohn, 1973). The authors quantified variation 
in dispensation of health care by the ratio of the 
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maximum to the minimum rate, denoted the ex- 
tremal quotient. 

An obvious problem with the extremal quotient 
occurred in small areas where the minimum observed 
rate can often be zero. Other indices of variability 
emerged, including the coefficient of variation, CV 
(Chassin et al., 1986), and the systematic compo- 
nent of variation, SCV (McPherson et al., 1982). 
The former was used to quantify overall variation, 
while the SCV, defined as the difference between the 
total observed variation and the within-institution 
variation, was used as an estimate of interinstitution 
variability. However, these measures were shown to 
be sensitive to the number of institutions, the per- 
institution sample size and the underlying rate (Diehr 
et al., 1990). 

The first HCFA report contained observed and 
expected mortality data for all acute-care nongovern- 
mental hospitals. Hospitals with higher-than-expect- 
ed mortality rates were classified as underperform- 
ing institutions. Let Yij denote the outcome for the 
jth patient treated at the ith institution, let Xjj be 
a vector of patient-specific characteristics and let 
rii be the number of cases treated at institution i, 
where i = 1,2, ... ,1. Each hospital-specific mortal- 
ity rate was calculated as the observed number of 
deaths at the institution divided by the number of 
cases, yi = ^ J2j Vij- The expected mortality rate for 
a patient was modeled assuming 

Yij *~ Bern(pij), 

(1) 

where logit(pjj) = ao + ct'iXjj. 

The covariates, Xjj, were obtained from administra- 
tive claims data. The expected mortality rate in in- 
stitution i was calculated as 

(2) yi = — logit"^(do + 

and compared to observed mortality using 

~ 

Var(yi - in) 

where Var(yj — iji) was approximated using a Taylor 
series expansion (Mood, Graybill and Boes, 1973). 
Hospitals with Zi > 1.645 were identified as outliers 
having higher than expected mortality. 

In response to the earlier criticisms of the first 
mortality reports, several initiatives have been un- 
dertaken by HCFA (and now by CMS) to stream- 
line in-depth data collection in order to better risk- 
adjust for case-mix differences. Nonetheless many 



report cards continue to use the observed data to 
calculate tail probabilities, for example, calculating 
^ X y, a corresponding 95% CI, and classifying an 
institution as outlying if its 95% interval excludes y. 

3.2 Modern Approaches 

Statistical researchers criticized the methodology 
utilized by HCFA/CMS on various methodological 
grounds (Thomas, Longford and Rolph, 1994; Gold- 
stein and Spiegelhalter, 1996; Normand, Glickman 
and Gatsonis, 1997). The criticisms related to a lack 
of attention paid to the sampling variability due to 
large differences in the number of cases per hospi- 
tal, ignoring the statistical dependence among out- 
comes within a hospital, failing to estimate inter- 
and intrahospital variance components, and utiliza- 
tion of a classification system that labels a predeter- 
mined number of hospitals as having quality prob- 
lems when excess mortality could be due to random 
error. Table 1 demonstrates many of these prob- 
lems using the Massachusetts data. The number of 
cases varies by an order of magnitude as do the 
observed mortality rates. If we assume an average 
expected mortality rate of 2.19%, then observing 
no mortalities at the new programs is not surpris- 
ing. For example, in Hospital 5 with 26 cases the 
probability of no deaths is 0.56 and in Hospital 9 
with 80 cases it is 0.17. On the other hand, there 
are 15 mortalities at Hospital 3. Ignoring case-mix, 
the probability of observing 15 mortalities for the 
381 cases is 0.01, again assuming an underlying rate 
of 2.19%. Because patients are not randomized to 
hospitals, patient selection is a real issue — the last 
column of Table 1 illustrates this point. The ex- 
pected mortality rate at the 13 institutions, esti- 
mated using Ai = ;i- E"=i E{Yij \ Xij,(3i,jj., f^) (Sec- 
tion 3.3), indicates that patients treated at Hospital 
3 are relatively sicker, with an expected mortality 
rate of 2.94%, compared to those treated at Hospi- 
tal 9 where the expected mortality rate is less than 
1%. 

To overcome the statistical shortcomings of the 
HCFA approach, researchers proposed the use of hi- 
erarchical models to describe hospital mortality, 

(4) Yij I (3oi *~ Bern(pij), 
where logit(pij) = /3oi + /JiXjj, 

(5) /3o^"~-A^(;U,r2). 
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In (5) represents between-hospital variation. The 
hierarchical model mimics the hypothesis that un- 
derlying quality leads to systematic differences among 
true hospital outcomes. If there are no between- 
hospital differences in mortality and x has been de- 
fined appropriately, then = and /3oi = /9o2 = 
■ ■ ■ = Poi = Po- While it is almost certain that > 0, 
the question is whether r is small enough to ignore. 
An important feature of the hierarchical model re- 
lates to the multiple comparison issue. Multiplicity 
of parameter estimation is addressed by integrating 
all the parameters into a single model, for example, 
a common distribution for the /3oj's. Regression to 
the mean is naturally accommodated because poste- 
rior estimates of the random intercepts, or of func- 
tions of the random intercepts, are "shrunk" toward 
the mean (Christiansen and Morris, 1997; Normand, 
Glickman and Gatsonis, 1997). 

An implicit assumption in the model defined by 
(4)-(5) is that hospital mortality is independent of 
the number of patients treated at the hospital. While 
some researchers have shown volume to be related 
to mortality, the relationship between institutional 
volume and mortality is relatively weak in the case 
of CABG surgery (Peterson et al., 2004; Shroyer et 
al., 1996; Shahian and Normand, 2003). 

3.3 Case-Mix Adjustments 

The most controversial issue continually raised by 
institutions is the adequacy of risk adjustment. Be- 
cause patients are not randomized to institutions, 
statistical adjustments are used to adjust for ob- 
served imbalances (Harrell, 2001). Adjustments are 
made through regression modeling although recent 
suggestions involve the use of propensity scores 
(Huang et al., 2005). The expected mortality rate 
at an institution is calculated as the number of ex- 
pected deaths divided by the number of patients, 

(6) Hi = — '^E(Yij\xij,l3^,fi,T^). 

In addition to the type of analytical adjustment 
used, issues regarding inclusion of types of covari- 
ates are also important. For example, because ad- 
ministrative databases contain diagnoses upon hos- 
pital discharge, only those diagnostic codes that are 
thought to be present on admission are included in 
a risk model. For example, a discharge diagnosis of 
pneumonia, while predictive of mortality, may have 
arisen because of poor quality of care. Adjustment 



Table 2 

Mean and adjusted odds ratios of 30-day moHality following 
isolated CABG surgery in adults, Massachusetts, 2002 



Odds 95% posterior 









lilLUl Vol 


Years > 65^ 


1.5 


1.05 


1.02, 1.07 


Male 


74.5 


0.60 


0.39, 0.96 


Renal lailure 


7.3 


2.39 


1.32, 3.93 


Diabetes mellitus 


38.0 


1.17 


0.72, 1.76 


Hypertension 


77.0 


2.91 


1.35, 6.26 


Peripheral vascular disease 


18.0 


1.73 


1.05, 2.66 


Prior percutaneous 








coronary intervention 


18.6 


0.87 


0.48, 1.44 


Cardiogenic shock 


2.2 


3.16 


1.29, 6.45 


Ejection fraction 








> 40 


75.5 


1.00 


— 


< 30% or missing 


12.8 


1.48 


0.79, 2.44 


30-39 


11.7 


1.33 


0.68, 2.27 


Myocardial infarction 








No myocardial infarction 


51.1 


1.00 




Within 6 hours 


0.9 


9.89 


2.44, 26.63 


7-24 hours 


1.8 


3.72 


1.15, 8.68 


1-7 days 


20.7 


1.10 


0.57, 1.90 


8-21 days 


5.7 


1.45 


0.56, 2.96 


> 21 days 


19.8 


1.43 


0.72, 2.54 


Status of CABG 








Elective 


34.0 


1.00 




Urgent 


62.0 


2.55 


1.29, 4.81 


Emergent /salvage 


3.0 


2.61 


0.79, 6.44 


Pre-op intra-aortic 


9.3 


2.57 


1.40, 4.37 



balloon pump 

Hierarchical model estimated using 4603 surgeries with 101 
deaths. 

^Represents the number of years over age 65 at time of 
surgery. 

for such factors could "adjust out" the effect of inter- 
est. Table 2 displays the prevalence, adjusted odds 
ratios and corresponding 95% posterior interval es- 
timates for the risk factors included in the Mas- 
sachusetts CABG mortality model. Not surprisingly, 
cardiogenic shock and timing of myocardial infarc- 
tion are the strongest predictors of 30-day mortality. 

3.4 Identifying Underperforming Institutions 

In addition to providing a standardized measure 
of outcome performance, virtually all report cards 
aim to identify institutions that are outliers. The 
key question is "what is an outlier?" The most com- 
mon approach involves estimating an adjusted rate 
and identifying institutions in the tails of the distri- 
bution. Estimation of the adjusted rates, of course, 
involves specifying prior distributions for the hyper- 
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parameters. The sensitivity of posterior inferences to 
the choice of the prior distributions is critical. More- 
over, if outlying hospitals are present, their data can 
influence the distribution of between-hospital vari- 
ance, r^. Two alternative approaches to identifying 
outliers are to (1) specify the hyperparameters and 
determine which hospitals are "out-of-control" or 
(2) estimate the predicted number of mortalities at 
each hospital through cross-validation and compare 
the predicted number to the observed number. 

3.4.1 Estimating the hyperparameters. The model 
described by (4)-(5) assumes that the random ef- 
fects are completely exchangeable, arising from a 
normal distribution with mean fj, and variance r^. 
The risk standardized mortality rate for each Mas- 
sachusetts cardiac surgery program, for example, is 

Risk Standardized Mortality 

(7) 

exp(/3o^+ftxij) , 
l+exp(/3o,+/3ix,j) ] ^ 2 19 

l+oxp{Ai+/3iXij) ^ 

and can be easily estimated using Markov chain 
Monte Carlo methods. 

The ratio in the parentheses in (7) is the Bayesian 
version of the "O" to "E" ratio used in earlier ver- 
sions of the HCFA reports. However, the "O" has 
been replaced by a shrinkage estimator that also ad- 
justs for hospital case-mix. This ratio has a causal 
interpretation. Multiplying the ratio by the number 
of cases in the ith hospital and then subtracting the 
number of cases yields the number of excess mortal- 
ities (or if negative, the number of additional sur- 
vivors) if the hospital's distribution of cases across 
risk categories had been what it was, but if its mor- 
tality rates across those risk categories were replaced 
by the state rates. The interested reader should see 
Draper and Gittoes (2004) and references therein 
where a counterfactual framework for estimators like 
that in (7) is discussed. The counterfactual distribu- 
tion used in the Massachusetts report card is deter- 
mined using the mortality risk observed in categories 
of patient types within the state and the prevalence 
of patient types observed within each hospital. 

Figure 1 displays the risk standardized 30-day mor- 
tality rates for the 13 Massachusetts cardiac surgery 
programs and corresponding 95% intervals. The es- 
timates are obtained using a vague proper conju- 
gate prior distribution for r~^, ~ Gamma(0.001, 
0.001) , a vague but proper prior for /i, ^ ~ A^(0, 1000) , 
and similar independent vague normal priors for the 



components of /J^ . A burn- in of 5000 draws is used 
and a subsequent 3000 iterations for inference. The 
institution random effects are estimated by shrink- 
ing the risk-adjusted rates to where the amount 
of shrinkage is measured by the ratio of the within- 
institution variance to the total variance. While none 
of the intervals excludes the state unadjusted rate of 
2.19%, the rate for Hospital 13 is clearly large with 
posterior mean 2.58 [median = 2.37]. 

The width of the interval estimate for Hospital 
13 is surprisingly wide given the observed volume of 
419 cases. Figure 2 displays the relationship between 
the numerator and denominator values simulated 
from the posterior distribution for four hospitals 
with varying volume: Hospital 5 with 26 cases. Hos- 
pital 7 with 718 cases, Hospital 8 with 149 cases, and 
Hospital 13 with 419 cases. The severity of the pa- 
tient populations can be contrasted across the hos- 
pitals by examining the distribution of the draws on 
the X-axis and noting sicker populations are shifted 
to the right. The distribution of draws above the 
X = y line in the graphs indicates increased bino- 
mial variability with the higher observed mortality 
rate at Hospital 13. The predicted probabilities at 
this institution have a skewed distribution. 

Sensitivity of posterior distribution to prior spec- 
ification. The sensitivity of posterior inferences to 
choice of prior distribution for is particularly im- 
portant when comparing institutions (Gelman, 2002, 
2006). The degree of sensitivity relates to the num- 
ber of institutions or the sample size per institu- 
tion. Spiegelhalter, Abrams and Myles (2004) have 
provided interpretations for plausible values for the 
standard deviation, r, in order to choose a prior. 
One interpretation involves specifying a plausible 
range for the ratio of the 97.5% odds of mortality 
to the 2.5% odds of mortality, say a, and then solv- 
ing exp(3.92T) = a to determine a value for r. For 
example, when r = 0.1 the range in odds ratios is 
1.48 — the odds of dying at a "high" -mortality hospi- 
tal relative to a "low" -mortality hospital — and this 
may be viewed as an acceptable range in variability 
for the random effects. A second method involves 
considering the absolute difference between a ran- 
dom pair of Poi's. The distribution of this difference, 
assuming normality for the random effects, is a half- 
Normal distribution with median 1.09t. Thus, if a 
reasonable upper 95% point for r, ro.95, can be spec- 
ified, then r~half-Normal((ro.95/l. 96)2). 

Table 3 illustrates the effects of the choice of prior 
specification of the variance component on poste- 
rior estimates of for the Massachusetts data. The 
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Fig. 1. Risk standardized mortality rates (%) for 13 cardiac programs. Posterior mean and 95% posterior intervals. 



Table 3 

Sensitivity of posterior distribution of the hyperparameters to prior specification of 

between-institution variance 



Posterior 
summaries 



Prior distribution 



Gamma(0. 001, 0.001) 



' Unif(0, 1.5) T ~ half-Normal(0.26) 





Mean 


0.04 


0.08 


0.09 




Median 


0.02 


0.04 


0.05 




95% PI 


(0.0007, 0.24) 


(0.0007, 0.39) 


(0.00009, 0.38) 




Mean 


-6.75 


-6.74 


-6.79 




Median 


-6.72 


-6.23 


-6.78 




95% PI 


(-7.79, -5.88) 


(-7.97, -5.95) 


(-7.68, -5.75) 



PI = posterior interval. 
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Fig. 2. Expected versus predicted mortality rates (in %) for four cardiac surgery programs. Each plot displays the draws from 
the posterior distribution used for inference. The x-axis displays expected mortality rate [denominator in (7) divided by Ui] 
while the y-axis displays the shrinkage estimate [numerator in (7) divided by Ui] denoted "predicted" mortality in the figure. 
The solid line is the x = y line. 



Gamma prior tends to place more weight on small 
values of than the other priors. A uniform prior 
for the standard deviation over the range to 1.5 im- 
plies small values as equally likely as large values and 
the half-Normal uses ro.95 = 1. Not surprisingly, the 
posterior estimates of the between-program variance 
are smallest under the Gamma prior and largest un- 
der the half-Normal. Consequently the Gamma prior 
shrinks the program random effects toward the over- 
all mean more than that with the other two prior 
distributions. 

Posterior predictive p-values. A disadvantage of 
estimating the hyperparameters relates to estima- 
tion of r^. If there is substantial between-institution 
variance, the posterior estimate of may actually 



mask outliers by accommodating a larger (than ac- 
ceptable) estimate of between-hospital variance. A 
method to help detect this problem involves quan- 
tifying the discrepancy between the data and the 
model through replication (Gelman et al., 2004). For 
institutional profiling this idea is implemented by 
generating data sets with the same number of insti- 
tutions, the same distribution of institution sample 
sizes and the same covariate distributions as those 
observed, and then comparing the observed number 
of mortalities at each institution to the posterior 
predictive distribution of the number of deaths, 

/ / Hfi > Vi I data) 
Jn Jyi 
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(8) 

■ f{yl\n) f in\ data) dy^dn. 

Here y[ denotes the mean mortality at institution i 
in the rephcated data set, denotes the vector of 
model parameters {(3,ij,,t'^) and > yi \ data) is 
an indicator variable assuming a value of 1 if the 
replicated mean is larger than the observed mean. If 
the tail-area probability (denoted a posterior predic- 
tive p-value) is extreme, beyond 0.99 or below 0.01, 
and if the difference between the observed and repli- 
cated means is of practical significance, then this 
provides some evidence that the model for Hospital 
13 is questionable. Column 2 in Table 4 lists the pos- 
terior predictive p- values when using a Gamma(0.001, 
0.001) prior distribution for r~^. Like the estimate 
of the risk standardized mortality rate, the p- values 
cast some suspicion on Hospital 13 (p- value = 0.03). 
The mean [95% PI] replicated mortality rate for 
Hospital 13 is 1.99% [0.7%, 3.82%] while the ob- 
served rate is 3.58%, a difference of practical impor- 
tance. 

3.4.2 Specifying the hyperparameters. Rather than 
estimating the hyperparameters, acceptable or in- 
control values of the hyperparameters could be pre- 
specified. This approach has the advantage of explic- 
itly stating a standard and does not label a predeter- 
mined number of hospitals as having excess mortal- 
ity. Posterior predictive p-values can be calculated 
in a way similar to that described in (8). Because 
the data contain more information about /i than r, 
we permit the data to estimate ji. The key issue 
in this approach is how to select the values of r 
that suggest "acceptable" variation. To determine 
the values, again, the guidance given by Spiegelhal- 
ter, Abrams and Myles (2004) is helpful. 

Columns (3) and (4) in Table 4 list the poste- 
rior predictive p- values assuming the random effects 
arise from a Normal distribution with unknown mean 
and two different values for r. Specifying = (0.10)^ 
implies that we are willing to accept 95% of the ran- 
dom effects to lie in a range of 1.5 in the odds ratio 
across the 13 cardiac programs. As in the other anal- 
yses. Hospital 13 appears on the boundary with a 
p-value of 0.02. The mean replicated observed mor- 
tality rate under this model is 1.96% compared to 
the observed value of 3.58%. Using = (0.01)^ indi- 
cates a willingness to tolerate virtually no between- 
hospital variation. Under this condition, the model 
effectively reduces to a logistic regression model with 



known intercept and no dependence on hospital. In 
this case (Column 4), the replicated rates should be 
close to the observed mean of 2.19% due to very 
little shrinkage. Hospital 12, which had an observed 
rate of 1.10%, is more than one full percentage point 
lower than the mean replicated rate (p- value = 0.01) 
when we specify a small value for r^. If is made 
very large, then the model reduces to a logistic re- 
gression model with a fixed parameter for each hos- 
pital. 

3.4.3 Cross-validation. Another method for iden- 
tifying outlying hospitals is through cross-validation. 
In this approach, each hospital is dropped from the 
analysis, the parameters of the model, O, are esti- 
mated, and then the mortality rate at the dropped 
institution is predicted by averaging over the poste- 
rior distribution. In a manner similar to the meth- 
ods discussed in Sections 3.4.1 and 3.4.2, a posterior 
predictive p-value can be computed. 

Column (5) of Table 4 presents the cross- validated 
p-values when systematically eliminating each hos- 
pital. Hospital 13 is again suspect with a posterior 
predictive p-value of 0.01. The remaining columns 
of Table 4 summarize the posterior estimates of the 
hyperparameters when excluding each hospital. The 
estimate of is substantially smaller when Hospi- 
tal 13 is eliminated from the model, approximately 
1/3 smaller than the estimate when this hospital is 
included in the model. 

Other models are available to characterize outly- 
ingness. For example, rather than assuming the in- 
stitutional effects are completely exchangeable, par- 
tial exchangeability could be accommodated through 
a mixture model. 

4. LOOKING AHEAD 

The use of a single response to characterize an 
institution's quality of care, even when confined to 
care for a specific disease, is rather simplistic. In 
fact, under the leadership of the National Commit- 
tee for Quality Assurance (NCQA), the Joint Com- 
mission on Accreditation of Healthcare Organiza- 
tions (JCAHO), and various professional societies, 
consensus has emerged around core sets of process 
and outcomes measures for particular diseases and 
surgical procedures. NCQA, for example, sponsors 
and maintains the Health Plan Employer Data and 
Information Set (HEDIS) that consists of standard- 
ized performance measures and consumers' experi- 
ences for the purposes of comparing managed health 
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Table 4 





Analytical strategies for identifying poorly performing cardiac 


programs 








Cardiac surgery Replication: posterior 
program p-values assuming /3oi 


predictive 
N(u r^") 




Cross-validation assuming 
(3oi ~ N{fi,T^) 








r" unknown 


= (0.10)^ 


= (0.01)^ 


value 


fJ-i 


T 






(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 




1 


0.35 


0.38 


0.50 


0.40 


-6.46 


(0.174)^ 


[(0 


132")2l 


2 


0.53 


0.48 


0.45 


0.49 


-6.73 


(0.183)^ 


[(0 


143") 2] 


3 


0.13 


0.13 


0.17 


0.12 


-6.82 


(0.182)2 


[(0 


13212 


4 


0.26 


0.22 


0.13 


0.22 


-7.19 


(0.185)2 


[(0 


145)2] 


5 


0.27 


0.27 


0.32 


0.26 


-6.53 


(0.158)2 


[(0 


120)2] 


6 


0.38 


0.37 


0.27 


0.30 


-6.95 


(0.180)2 


[(0 


146)2] 


7 


0.30 


0.31 


0.40 


0.33 


-6.96 


(0.187)2 


[(0 


145)2] 


8 


0.36 


0.34 


0.25 


0.35 


-6.77 


(0.137)2 


[(0 


103)2] 


9 


0.49 


0.46 


0.35 


0.47 


-6.48 


(0.149)2 


[(0 


112)2] 


10 


0.46 


0.42 


0.35 


0.45 


-6.60 


(0.199)2 


[(0 


152)2] 


11 


0.43 


0.43 


0.48 


0.43 


-6.57 


(0.184)2 


[(0 


130)2] 


12 


0.20 


0.16 


0.01 


0.17 


-6.69 


(0.160)2 


[(0 


123)2] 


13 


0.03 


0.02 


0.03 


0.01 


-6.54 


(0.127)2 


[(0 


100)2] 



All calculations assume u ~ N{0, 1000). r^2 ^ Gamma(0.001, 0.001) unless specified otherwise. 

(2) : Posterior probability observed mortality rate is more extreme than replicated mortality rate using all hospitals. Posterior 
mean [median] for = -6.75 [-6.72]; r2 = 0.042 [0.016]. 

(3) and (4): Posterior probability observed mortality rate is more extreme than replicated rate using all hospitals and assuming 
an in control prior distribution. Posterior mean (SD) for /i, Column 3: —6.34 (0.392) and Column 4: —6.76 (0.513). 

(5) : Posterior predictive probability that observed mortality rate is more extreme than the predicted mortality rate. Predictions 
use estimates based on all hospitals except i. 

(6) and (7): Posterior mean average log-odds and mean [median] variance based on all hospitals except i. 



care plans in the United States. The National Qual- 
ity Forum has endorsed a set of 21 measures for 
assessing institutional CABG surgery quality (see a 
subset of measures in Table 5). The Society of Tho- 
racic Surgeons has recently developed a multidimen- 
sional composite quality measure and rating system 
that utilizes structure, process and outcomes mea- 
sures. 

Interestingly, payors of health care have initiated 
pay-for-performance programs that offer bonus pay- 
ment to hospitals that achieve high performance on 
the core sets (Galvin and Milstein, 2002; IVlilstein 
et al., 2000). Some health plans tier hospitals based 
on value similar to a tiered pharmacy benefit. Pa- 
tients using hospitals classified in the high-value tier 
pay lower coinsurance or copayments at the point 
of care — often a 10% lower copayment (Steinbrook, 
2004). The Ivledicare Modernization Act passed in 
2003 established financial incentives for hospitals to 
provide CIVlS with data on quality indicators. The 
Hospital Quality Incentive Demonstration Project 
was launched in July 2003 to measure quality and 
pay incentives to participating hospitals that achieve 
"superior" levels of quality in five clinical areas 



Table 5 

Examples of process-based measures of hospital quality for 
patients undergoing isolated CABG surgery 



Measure 



Description 



Pre-operative Percent of patients receiving beta- 

beta-blockade blockers within 24 hours preceding 
surgery 

Percent of patients receiving an internal 
mammary artery graft 



Use of internal 
mammary 
artery 

Discharge Medications for In-Hospital Survivors 
Beta-blockade Percent of patients discharged on beta 

blockers 

Anti-platelet Percent of patients discharged on anti- 

agent platelet therapy 

Anti-lipid Percent of patients discharged on a 

treatment statin or other pharmacologic lipid- 

lowering regimen 



(www.cms.hhs.gov). In a similar fashion, a consor- 
tium of organizations, including among others, CIVIS, 
the Joint Commission on Accreditation of Health- 
care Organizations and the American Hospital Asso- 
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ciation, initiated the Hospital Quality Alliance 
(www. cms .hhs .gov/HospitalQualitylnits/). For 

fiscal years 2005 through 2007 eligible acute care 
hospitals reporting data to CMS receive an increase 
in the annual payment update from CMS for each 
of the DRG's under consideration (Jha et al., 2005). 
The data reported to CMS are standard clinical per- 
formance measures that have detailed specification, 
such as those listed in Table 5, and typically consist 
of a denominator that reflects the total number of 
patients eligible for a measure and a numerator that 
reflects the total number of eligible patients that re- 
ceived the therapy. 

While a single measure may not be sufficient to de- 
scribe the quality of care for a particular institution 
(Normand et al., 2007), the use of multiple measures 
can be challenging to consumers and to regulators 
(Epstein, 1998). Statistical issues regarding how to 
simultaneously model the multiple responses, how to 
derive a summary or composite measure, and how 
to select superiorly performing institutions arise. 

To facilitate collection of performance measures, 
CMS has defined a set of Healthcare Common Pro- 
cedure Coding System codes, termed G-codes, to 
supplement the usual claims data with clinical data. 
The goal is to use these new codes to define nu- 
merators and denominators for various performance 
measures. Availability of electronic health records 
may facilitate reporting of clinical data even further. 
Administrative data fashioned like California's inpa- 
tient discharge data supplemented with the new G- 
codes should lead to improved yet feasible databases. 

4.1 Multiple Outcomes and Composite Measures 

Let l^fc denote the number of patients receiving 
needed therapy k at institution i and let riik de- 
note the number of patients who should receive ther- 
apy k. For example, riik may denote the number of 
patients undergoing CABG surgery who were dis- 
charged alive and Yik is the number of these patients 
who were prescribed anti-lipid therapy. 

In its Hospital Quality Incentive Demonstration 
Project, CMS calculates 

(9) = 

for each of / hospitals and identifies hospitals falling 
into the 90th percentile of the empirical distribution 
of {yi, 2/2, • • • , y/}- While such an approach is eas- 
ily understood, it has several drawbacks. The opti- 
mal pooling algorithm depends on the type of mea- 
surement error associated with each measure, the 



correlation among the measures on the same indi- 
vidual, the scales of the measures and the missing 
data mechanism (see Horton and Fitzmaurice, 2004 
for a review). A more recent proposal (Nolan and 
Berwick, 2006) advocates an "all-or-none" rule that 
constructs a binary response for each patient: a suc- 
cess is coded if the patient received all the care for 
which the patient was eligible; otherwise a failure is 
coded. This particular method addresses the within- 
patient issues but the variable number of eligible 
measures per patient is ignored and the other issues 
raised earlier remain. 

The creation of composite measures to reflect per- 
formance is not new or unique to the assessment of 
health care. Lessons learned from education again 
are useful. The Stanford Achievement Test (SAT), 
first published in 1923, has been used to derive two 
composite scores, one for math and one for verbal 
ability, in order to assess student ability. The Na- 
tional Assessment of Education Progress (NAEP) is 
a U.S. congressionally mandated national survey to 
derive proficiencies scores to measure academic per- 
formance of U.S. students. Both the SAT and the 
NAEP use hierarchical models adapted from item 
response theory (IRT) to scale responses. 

Using a similar approach, hospital composites could 
also be created. In the case of a collection of binary 
measures, the observed number of patients receiving 
needed therapy k may be thought of as arising from 
a Rasch model (Rasch, 1960), 

Yik I /?,6'i ~Bin(nife,pifc), 

(10) where logit(pifc) = /?ofc - POi 

and 0i*~'^-iV(O,l), 

where /3ofc denotes the difficulty of measure fc, 9i 
denotes the underlying quality of the institution and 
(3 is the precision of 9i. In (10) higher values of 9 
correspond to better quality of care, and would thus 
serve as the composite measure of hospital quality. 
If this model is commensurate with the data, then 
yi is a minimally sufficient statistic for 9i (Skrondal 
and Rabe-Hesketch, 2004). However, if the measures 
have different abilities to discriminate quality, then 
the following model is more reasonable: 

Yik |/3fc,6'j~Bin(nifc,pa-)> 

(11) where logit(pifc) = /3ofc - I3ik9i 

and e/-^ •iV(0,l). 



CABG Measures 
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Fig. 3. One-parameter (Rasch model) and two-parameter models for assessing hospital quality. The x-axis represents un- 
derlying hospital quality; the y-axis represents the probability of providing needed therapy. One-parameter model on left; two- 
parameter model on right. 



Here Pik measures how well the kth measure dis- 
criminates between hospitals with different qualities 
and is equivalent to a factor loading. The model in 
(11) is often referred to as a two-parameter logistic 
item response model while that in (10) is denoted a 
one-parameter logistic item response model. If (11) 
holds, then yi is not sufficient for 6i and financial in- 
centives rewarded on the basis of the observed statis- 
tic could be distorted. 

Figure 3 contrasts the relationship between the 
probability of providing needed therapy and insti- 



tutional quality assuming the model in (10) holds 
(left panel) with the relationship when the model in 
(11) holds (right panel). The figures were obtained 
through simulation and correspond to four hypo- 
thetical process-based measures. In the left panel, 
the four measures are equally discriminating of hos- 
pital quality, while the right panel illustrates that 
measures 3 and 4 are more discriminating of qual- 
ity as demonstrated by the sharp changes from low 
to high probabilities of providing needed therapy. 
Landrum, Bronskill and Normand (2000) demon- 
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strated these modeling issues when comparing hos- 
pital acute myocardial infarction quality. 

As in the hierarchical model, the IRT models as- 
sume that the probability of receiving needed ther- 
apy is independent of the number of patients treated 
at the hospital. This assumption may be less be- 
lievable in the context of process-based measures 
than in the context of mortality following CABG 
surgery. Furthermore, the IRT model assumes that 
different response measures within a hospital are 
conditionally independent given underlying hospital 
quality. There are two potential violations to this 
last assumption. First, patients contribute multiple 
measurements to the hospital composite. This issue 
can be easily addressed by adding another level to 
the IRT model, making the unit of observation the 
patient-measure and accounting for within-patient 
correlation. Even if the within-patient correlation is 
accommodated, there may be item-clustering (Scott 
and Ip, 2002; Bradlow, Wainer and Wang, 1999). 
This occurs if there are clusters of items or mea- 
sures about a common stimulus used in assessing 
outcomes. While this is unlikely to arise in the con- 
text of process-based measures, it could occur when 
using patient surveys. 

4.2 Multiple Mixed Outcomes 

Efforts are underway to develop measures of hos- 
pital "efficiency" as reflected by costs, and to in- 
clude these measures in pay-for-performance pro- 
grams. Most efficiency measures assess technical ef- 
ficiency defined as the cost of an episode of care 
using the least amount of resources. Because cost ef- 
ficiency ignores information about health outcomes, 
some attempts have been made to examine cost and 
efficiency jointly. Specification of the joint distribu- 
tion in the case of mixed outcomes has been the 
activity of recent methodological developments. If 
the measures are made on different scales but quan- 
tify the same underlying construct, then latent vari- 
able models (Sammel, Ryan and Legler, 1997) can 
be used to jointly model the observed outcomes. La- 
tent variable models have also been extended to ac- 
commodate clustered outcomes (Dunson, 2000; Lee 
and Shi, 2001; Landrum, Normand and Rosenheck, 
2003), although there has not been much practical 
experience. However, much less methodology and 
experience are available for modeling longitudinal 
mixed clustered outcomes (Daniels and Normand, 
2006). 



4.3 Concluding Remarks 

While comparative profiling of health care insti- 
tutions has been ongoing for more than a century, 
it has only been in the last decade and a half that 
statisticians have become actively involved. In this 
article, we reviewed the clinical considerations and 
implications of cardiac surgery profiling, and impor- 
tantly, several methodological issues. It is intuitively 
obvious that there will be some between-institution 
variability. The conceptual issue relates to how much 
variability is acceptable, and how to quantify what 
is meant by over per forming and under per forming in- 
stitutions. As profiling becomes increasingly linked 
to financial incentives, it is likely that the analyti- 
cal methods for classifying institutions will be more 
closely scrutinized. 

We concentrated on the analytical aspects of out- 
comes profiling, where design is an especially impor- 
tant consideration. Determination of power is par- 
ticularly challenging in this area. Policy-makers will 
always be faced with low- volume providers and the 
difficulty of determining how best to characterize 
quality for providers with such small numbers. This 
problem can be particularly acute when comparing 
quality at the level of the individual physician (Lan- 
don et al., 2003). 

Finally, we did not focus on the vexing issue of 
provider selection bias — unmeasured risk factors that 
confound case-mix with institutional quality. With 
more interest in causal inference, development of 
sensitivity analyses or instrumental variable anal- 
yses that can handle large numbers of "treatments" 
that characterize the profiling problem (13 in our 
cardiac surgery program example) will be critical. 
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